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Abstract. Speech recognition has of late become a practical technol- 
ogy for real world applications. Aiming at speech-driven text retrieval, 
which facilitates retrieving information with spoken queries, we propose 
a method to integrate speech recognition and retrieval methods. Since 
users speak contents related to a target collection, we adapt statistical 
language models used for speech recognition based on the target collec- 
tion, so as to improve both the recognition and retrieval accuracy. Ex- 
periments using existing test collections combined with dictated queries 
showed the effectiveness of our method. 



1 Introduction 

Automatic speech recognition, which decodes human voice to generate tran- 
scriptions, has of late become a practical technology. It is feasible that speech 
recognition is used in real world computer-based applications, specifically, those 
associated with human language. In fact, a number of speech-based methods have 
been explored in the information retrieval community, which can be classified 
into the following two fundamental categories: 

— spoken document retrieval, in which written queries are used to search speech 

(e.g., broadcast news audio) archives for relevant speech information P,|6[ [l^ , [T6| , p7| , ^9|j20[ | , 

— speech-driven (spoken query) retrieval, in which spoken queries are used to 
retrieve relevant textual information . 

Initiated partially by the TREC-6 spoken document retrieval (SDR) track Q, 
various methods have been proposed for spoken document retrieval. However, a 
relatively small number of methods have been explored for speech-driven text 
retrieval, although they are associated with numerous keyboard-less retrieval 
applications, such as telephone-based retrieval, car navigation systems, and user- 
friendly interfaces. 
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Barnett et al. Q performed comparative experiments related to speech-driven 
retrieval, where an existing speech recognition system was used as an input inter- 
face for the INQUERY text retrieval system. They used as test inputs 35 queries 
collected from the TREC 101-135 topics, dictated by a single male speaker. 
Crestani || also used the above 35 queries and showed that conventional rele- 
vance feedback techniques marginally improved the accuracy for speech-driven 
text retrieval. 

These above cases focused solely on improving text retrieval methods and did 
not address problems of improving speech recognition accuracy. In fact, an ex- 
isting speech recognition system was used with no enhancement. In other words, 
speech recognition and text retrieval modules were fundamentally independent 
and were simply connected by way of an input/output protocol. 

However, since most speech recognition systems are trained based on specific 
domains, the accuracy of speech recognition across domains is not satisfactory. 
Thus, as can easily be predicted, in cases of Barnett et al. Q and Crestani ||, a 
relatively high speech recognition error rate considerably decreased the retrieval 
accuracy. Additionally, speech recognition with a high accuracy is crucial for 
interactive retrieval. 

Motivated by these problems, in this paper we integrate (not simply connect) 
speech recognition and text retrieval to improve both recognition and retrieval 
accuracy in the context of speech-driven text retrieval. 

Unlike general-purpose speech recognition aimed to decode any spontaneous 
speech, in the case of speech-driven text retrieval, users usually speak contents 
associated with a target collection, from which documents relevant to their in- 
formation need are retrieved. In a stochastic speech recognition framework, the 
accuracy depends primarily on acoustic and language models jl| . While acoustic 
models are related to phonetic properties, language models, which represent lin- 
guistic contents to be spoken, are strongly related to target collections. Thus, it 
is intuitively feasible that language models have to be produced based on target 
collections. 

To sum up, our belief is that by adapting a language model based on a target 
IR collection, we can improve the speech recognition and text retrieval accuracy, 
simultaneously. 

Section describes our prototype speech-driven text retrieval system, which 
is currently implemented for Japanese. Section [| elaborates on comparative ex- 
periments, in which existing test collections for Japanese text retrieval are used 
to evaluate the effectiveness of our system. 

2 System Description 
2.1 Overview 

Figure |l| depicts the overall design of our speech-driven text retrieval system, 
which consists of speech recognition, text retrieval and adaptation modules. We 
explain the retrieval process based on this figure. 
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In the off-line process, the adaptation module uses the entire target collection 
(from which relevant documents are retrieved) to produce a language model, so 
that user speech related to the collection can be recognized with a high accuracy. 
On the other hand, an acoustic model is produced independent of the target 
collection. 

In the on-line process, given an information need spoken by a user, the speech 
recognition module uses the acoustic and language models to generate a tran- 
scription for the user speech. Then, the text retrieval module searches the collec- 
tion for documents relevant to the transcription, and outputs a specific number 
of top-ranked documents according to the degree of relevance, in descending 
order. 

These documents are fundamentally final outputs. However, in the case where 
the target collection consists of multiple domains, a language model produced in 
the off-line adaptation process is not necessarily precisely adapted to a specific 
information need. Thus, we optionally use top-ranked documents obtained in 
the initial retrieval process for an on-line adaptation, because these documents 
are associated with the user speech more than the entire collection. We then re- 
perform speech recognition and text retrieval processes to obtain final outputs. 

In other words, our system is based on the two-stage retrieval principle 
where top-ranked documents retrieved in the first stage are intermediate results, 
and are used to improve the accuracy for the second (final) stage. From a different 
perspective, while the off-line adaptation process produces the global language 
model for a target collection, the on-line adaptation process produces a local 
language model based on the user speech. 

In the following sections, we explain speech recognition, adaptation, and text 
retrieval modules in Figure 0, respectively. 



2.2 Speech Recognition 

The speech recognition module generates word sequence W, given phoneme se- 
quence X . In the stochastic speech recognition framework, the task is to output 
the W maximizing P(W\X), which is transformed as in equation (Q) through 
use of the Bayesian theorem. 

argm&x P(W\X) = argmaxF(X|VF) ■ P{W) (1) 

Here, P(X\W) models a probability that word sequence W is transformed into 
phoneme sequence X, and P(W) models a probability that W is linguistically 
acceptable. These factors are usually called acoustic and language models, re- 
spectively. 

For the speech recognition module, we use the Japanese dictation toolkit [(tJQ 
which includes the "Julius" recognition engine and acoustic/language models 
trained based on newspaper articles. This toolkit also includes development soft- 
wares, so that acoustic and language models can be produced and replaced de- 
pending on the application. While we use the acoustic model provided in the 

1 http: / /winnie. kuis.kyoto-u.ac.jp / dictation / 
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Fig. 1. The overall design of our speech-driven text retrieval system. 



toolkit, we use new language models produced by way of the adaptation process 



(see Section 2.3) 



2.3 Language Model Adaptation 

The basis of the adaptation module is to produce a word-based iV-gram (in our 
case, a combination of bigram and trigram) model by way of source documents. 

In the off-line (global) adaptation process, we use the ChaSen morphological 
analyzer to extract words from the entire target collection, and produce the 
global A^-gram model. 

On the other hand, in the on-line (local) adaptation process, only top-ranked 
documents retrieved in the first stage are used as source documents, from which 
word-based iV-grams are extracted as performed in the off-line process. How- 
ever, unlike the case of the off-line process, we do not produce the entire lan- 
guage model. Instead, we re-estimate only statistics associated with top-ranked 
documents, for which we use the MAP (Maximum A-posteriori Probability) es- 
timation method ||. 

Although the on-line adaptation theoretically improves the retrieval accuracy, 
for real-time usage, the trade-off between the retrieval accuracy and computa- 
tional time required for the on-line process has to be considered. 

Our method is similar to the one proposed by Seymore and Rosenfeld in 
the sense that both methods adapt language models based on a small number of 
documents related to a specific domain (or topic). However, unlike their method, 
our method docs not require corpora manually annotated with topic tags. 
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2.4 Text Retrieval 

The text retrieval module is based on an existing probabilistic retrieval method |l3; 
which computes the relevance score between the transcribed query and each doc- 
ument in the collection. The relevance score for document i is computed based 
on equation (0). 



E 



TF ^ i N i 

lo g7T^ (2) 



D h + TF t i DFt 
avglen 



Here, t's denote terms in transcribed queries. TF t ^ denotes the frequency that 
term t appears in document i. DF t and N denote the number of documents 
containing term t and the total number of documents in the collection. DLi 
denotes the length of document i (i.e., the number of characters contained in i), 
and avglen denotes the average length of documents in the collection. 

We use content words extracted from documents as terms, and perform a 
word-based indexing. For this purpose, we use the ChaSen morphological ana- 
lyzer Jl(| to extract content words. We extract terms from transcribed queries 
using the same method. 



3 Experimentation 
3.1 Test Collections 

We investigated the performance of our system based on the NTCIR workshop 
evaluation methodology, which resembles the one in the TREC ad hoc retrieval 
track. In other words, each system outputs 1,000 top documents, and the TREC 
evaluation software was used to plot recall-precision curves and calculate non- 
interpolated average precision values. 

The NTCIR workshop was held twice (in 1999 and 2001), for which two dif- 
ferent test collections were produced: the NTCIR- 1 and 2 collections , O q 
However, since these collections do not include spoken queries, we asked four 
speakers (two males/females) to dictate information needs in the NTCIR collec- 
tions, and simulated speech-driven text retrieval. 

The NTCIR collections include documents collected from technical papers 
published by 65 Japanese associations for various fields. Each document consists 
of the document ID, title, name(s) of author(s), name/date of conference, hosting 
organization, abstract and author keywords, from which we used titles, abstracts 
and keywords for the indexing. The number of documents in the NTCIR-1 and 
2 collections are 332,918 and 736,166, respectively (the NTCIR-1 documents are 
a subset of the NTCIR-2). 

The NTCIR-1 and 2 collections also include 53 and 49 topics, respectively. 
Each topic consists of the topic ID, title of the topic, description, narrative. 
Figure || shows an English translation for a fragment of the NTCIR topics^], 

2 http://research.nii.ac.jp/~ntcadm/index-en.html 

3 The NTCIR-2 collection contains Japanese topics and their English translations. 
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where each held is tagged in an SGML form. In general, titles are not informative 
for the retrieval. On the other hand, narratives, which usually consist of several 
sentences, are too long to speak. Thus, only descriptions, which consist of a 
single phrase and sentence, were dictated by each speaker, so as to produce four 
different sets of 102 spoken queries. 



<T0PIC q=0118> 

<TITLE>TV conferencing</TITLE> 

<DESCRIPTTON>Distance education support systems using TV 
conf erencing</DESCRIPTION> 

<NARRATTVE>A relevant document will provide information on 
the development of distance education support systems using TV 
conferencing. Preferred documents would present examples of using 
TV conferencing and discuss the results. Any reported methods 
of aiding remote teaching are relevant documents (for example, 
ways of utilizing satellite communication, the Internet, and ISDN 
circuits) . </NARRATIVE> 
</T0PIC> 



Fig. 2. An English translation for an example topic in the NTCIR collections. 



In the NTCIR collections, relevance assessment was performed based on the 
pooling method |lc| . First, candidates for relevant documents were obtained with 
multiple retrieval systems. Then, for each candidate document, human expert (s) 
assigned one of three ranks of relevance: "relevant," "partially relevant" and 
"irrelevant." The NTCIR-2 collection also includes "highly relevant" documents. 
In our evaluation, "highly relevant" and "relevant" documents were regarded as 
relevant ones. 

3.2 Comparative Evaluation 

In order to investigate the effectiveness of the off-line language model adaptation, 
we compared the performance of the following different retrieval methods: 

— text-to-text retrieval, which used written descriptions as queries, and can be 
seen as the perfect speech-driven text retrieval, 

— speech-driven text retrieval, in which a language model produced based on 
the NTCIR-2 collection was used, 

— speech-driven text retrieval, in which a language model produced based on 
ten years worth of Mainichi Shimbun Japanese newspaper articles (1991- 
2000) was used. 

The only difference in producing two different language models (i.e., those based 
on the NTCIR-2 collection and newspaper articles) are the source documents. 
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In other words, both language models have the same vocabulary size (20,000), 
and were produced using the same softwares. 

Table |l| shows statistics related to word tokens /types in two different source 
corpora for language modeling, where the line "Coverage" denotes the ratio of 
word tokens contained in the resultant language model. Most of word tokens 
were covered in both language models. 

Table 1. Statistics associated with source words for language modeling. 





NTCIR 


News 


# of Types 


454K 


315K 


# of Tokens 


175M 


262M 


Coverage 


97.9% 


96.5% 



In cases of speech-driven text retrieval methods, queries dictated by four 
speakers were used individually. Thus, in practice we compared nine different 
retrieval methods. Although the Julius decoder outputs more than one tran- 
scription candidate for a single speech input, we used only the one with the 
greatest probability score. The results did not significantly change depending on 
whether or not we used lower-ranked transcriptions as queries. 

Table |^ shows the non-interpolated average precision values and word error 
rate in speech recognition, for different retrieval methods. As with existing ex- 
periments for speech recognition, word error rate (WER) is the ratio between 
the number of word errors (i.e., deletion, insertion, and substitution) and the 
total number of words. In addition, we also investigated error rate with respect 
to query terms (i.e., keywords used for retrieval), which we shall call "term error 
rate (TER)." 

In Table |[ the first line denotes results of the text-to-text retrieval, which 
were relatively high compared with existing results reported in the NTCIR work- 
shops 

The remaining lines denote results of speech-driven text retrieval combined 
with the NTCIR-based language model (lines 2-5) and the newspaper-based 
model (lines 6-9), respectively. Here, "Mx" and "Fx" denote male/female speak- 
ers, respectively. Suggestions which can be derived from these results are as 
follows. 

First, for both language models, results did not significantly change depend- 
ing on the speaker. The best average precision values for speech-driven text re- 
trieval were obtained with a combination of queries dictated by a male speaker 
(Ml) and the NTCIR-based language model, which were approximately 80% of 
those with the text-to-text retrieval. 

Second, by comparing results of different language models for each speaker, 
one can see that the NTCIR-based model significantly decreased WER and TER 
obtained with the newspaper-based model, and that the retrieval method using 
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Table 2. Results for different retrieval methods (AP: average precision, WER: 
word error rate, TER: term error rate). 







NTCIR- 1 






NTCIR-2 




Method 


AP 


WER 


TER 


AP 


WER 


TER 


Text 


0.3320 






0.3118 






Ml (NTCIR) 


0.2708 


0.1659 


0.2190 


0.2504 


0.1532 


0.2313 


M2 (NTCIR) 


0.2471 


0.2034 


0.2381 


0.2114 


0.2180 


0.2799 


Fl (NTCIR) 


0.2276 


0.1961 


0.2857 


0.1873 


0.1885 


0.2500 


F2 (NTCIR) 


0.2642 


0.1477 


0.2222 


0.2376 


0.1635 


0.2388 


Ml (News) 


0.1076 


0.3547 


0.5143 


0.0790 


0.3594 


0.5149 


M2 (News) 


0.1257 


0.4044 


0.5460 


0.0691 


0.5022 


0.6343 


Fl (News) 


0.1156 


0.3801 


0.5238 


0.0798 


0.4418 


0.5709 


F2 (News) 


0.1225 


0.3317 


0.5016 


0.0917 


0.4080 


0.5858 



the NTCIR-based model significantly outperformed one using the newspaper- 
based model. In addition, these results were observable, irrespective of the speaker 
Thus, we conclude that adapting language models based on target collections 
was quite effective for speech-driven text retrieval. 

Third, TER was generally higher than WER irrespective of the speaker. 
In other words, speech recognition for content words was more difficult than 
functional words, which were not contained in query terms. 

We analyzed transcriptions for dictated queries, and found that speech recog- 
nition error was mainly caused by the out-of-vocabulary problem. In the case 
where major query terms are mistakenly recognized, the retrieval accuracy sub- 
stantially decreases. In addition, descriptions in the NTCIR topics often contain 
expressions which do not appear in the documents, such as "I want papers 
about..." Although these expressions usually do not affect the retrieval accu- 
racy, misrecognized words affect the recognition accuracy for remaining words 
including major query terms. Consequently, the retrieval accuracy decreases due 
to the partial misrecognition. 

Finally, we investigated the trade-off between recall and precision. Figures || 
and [| show recall-precision curves of different retrieval methods, for the NTCIR- 
1 and 2 collections, respectively. In these figures, the relative superiority for 
precision values due to different language models in Table ||was also observable, 
regardless of the recall. 

However, the effectiveness of the on-line adaptation remains an open question 
and needs to be explored. 

4 Conclusion 

Aiming at speech-driven text retrieval with a high accuracy, we proposed a 
method to integrate speech recognition and text retrieval methods, in which 
target text collections are used to adapt statistical language models for speech 
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Fig. 3. Recall-precision curves for different retrieval methods using the NTCIR-1 
collection. 




Fig. 4. Recall-precision curves for different retrieval methods using the NTC1R-2 
collection. 
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recognition. We also showed the effectiveness of our method by way of experi- 
ments, where dictated information needs in the NTCIR collections were used as 
queries to retrieve technical abstracts. Future work would include experiments 
on various collections, such as newspaper articles and Web pages. 
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